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Abstract 



We present some quantitative performance measurements for the computing power of Pro- 
grammable Active Memories (PAM), as introduced by [BRV89] . Based on Field Programmable 
Gate Array (FPGA) technology, the PAM is a universal hardware co-processor closely coupled 
to a standard host computer. The PAM can speed up many critical software applications 
running on the host, by executing part of the computations through a specific hardware design. 
The performance measurements presented are based on two PAM architectures and ten specific 
applications, drawn from arithmetics, algebra, geometry, physics, biology, audio and video. 
Each of these PAM designs proves as fast as any reported hardware or super-computer for 
the corresponding application. In cases where we could bring some genuine algorithmic 
innovation into the design process, the PAM was measured to be an order of magnitude faster 
than any previously existing system (see [SBV91] and [Sku92]). 



Resume 



Nous presentons quelques mesures quantitatives de performance arm d'evaluer la puissance 
de calcul des Memories Actives Programmables (PAM), introduites par [BRV89]. Basee 
sur la technologie des prediffuses programmables (FPGA), une PAM est un coprocesseur 
materiel universel connecte a un ordinateur hote standard. La PAM permet d'accelerer nombre 
d' applications logicielles specifiques. Les mesures de performance presentees ici sont basees 
sur deux architectures PAM et dix applications specifiques, tirees de l'arithmetique, de 
l'algebre, de la geometrie, de la physique, de la biologie, de 1' audio et de la video. Chacune 
des realisations PAM presentees se revele etre plus rapide que tout autre type de materiels 
ou de super-ordinateurs pour 1' application correspondante. Quand nous avons pu apporter des 
innovations algorithmiques dans le processus de conception, la PAM s'est montree au moins 
un ordre de grandeur plus rapide que tous les systemes anterieurs (voir [SBV91] et [Sku92]). 



Keywords 

Reconfigurable logic, programmable gate arrays, hardware configurable co-processor, hard- 
ware performance measurements. 
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1 PAM concept 

Like any RAM memory module, a PAM is attached to the system bus of a host computer. 
The processor can write into, and read from the PAM. Being an active hardware co-processor 
however, the PAM processes data between write and read instructions. The specific processing 
is determined by the content of its configuration memory. The host can change the PAM 
configuration by downloading a new design into it, within a few milliseconds. 

We speed up a specific software application running on the host, by executing its critical 
inner-loop through an appropriate hardware design downloaded into the PAM. Ten examples 
of such designs and applications are presented below. 

In selecting them, we have attempted to cover as many application areas as we could. In each, 
we picked basic and frequent enough problems, where a large inner-loop speedup through a 
specific PAM design results in a significant speedup of the application system, software and 
hardware combined. 

For the sake of specification and debugging, the functionality of each hardware design is 
matched, bit-wise at the I/O level, by a software implementation. Through successive 
refinements, both hardware and software implementations are optimized for speed, while 
retaining identical I/O behavior. This is how we derived our experimental data in this 
assessment of RAM's computing power. 

The traditional measurement unit here is the Mips 1 , or its multiple the Gips (1G = 10 9 ) which 
is more appropriate to the levels of performance at hand. As explained in [HP90], quantitative 
performance comparisons between different computer architectures is a challenging art. First, 
our Mips are measured in different units: for the 32b MIPS R3000 or the 64b Alpha AXP 
instruction sets? for which clock speed? cache size? bus bandwidth? Second, in a number 
of cases we have external data for both VLSI and super-computer implementations of systems 
close enough to some of ours, so that we could make meaningful speed comparisons. Such 
external data is typically given in Gfiops (Giga floating point operations per second) for super 
computers; for VLSI, number of transistors with operating clock frequency are normally 
quoted. 

To make quantitative comparisons significant across such a wide spectrum of feasible 
implementation technologies, we introduce a common unit for measuring all forms of 
computing power, the Gbops (billion of binary operations per second) where each useful 
boolean operation with up to 4 inputs, be it clocked or not, counts for one. For example the 
computing power of a standard carry-ripple 32b adder operating at 100 Mhz is 6.4 Gbops, 
accounting for both carries and sums. That of a n bits multiplier at 1 GHz is In 1 Gbops. Any 
specific computer operation can be similarly decomposed at the bit level, so we can evaluate 
its corresponding power in Gbops. 



'Million of instructions per second 
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2 Two PAM architectures 

Our assessment is based on two PAM architectures realized at DEC Paris Research Lab., 
DECPeRLe-0 (see [BRV89]) and DECPeRLe- 1 which we respectively refer to asP 0 and Pi in 
the following. 

Each PAM is built around a large array of bit-level configurable logic cells (hereafter called 
Programmable Active Bits or PABs) in which the application-specific hardware operator is 
programmed. This array is surrounded with local RAM banks used as a cache (wide and 
fast enough to match the PAM's processing bandwidth), a programmable clock generator, and 
some additional non-configurable logic to manage the host bus interface and the download 
process. 



Ext I/O 



Ext I/O 



DATA 



ADR 



Dwld/Rdb 



CNTR 



VME-Bus 




Cntr 



RAM 



RAM | 



<> 



RAM 



The above figure sketches the architecture of Po (1988). The central computational array is 
made of a 5 x 5 matrix of Xilinx XC3020 Programmable Gate Arrays [Xil87]; it has two 
32-bit wide RAM banks on the south and east sides, a VME bus interface on the west side 
and general-purpose interface connectors on the north side. The control and bootstrap logic is 
implemented in two extra XC3020 (non user-programmable). Finally, additional bus switching 
resources are provided for global data routing (represented here with diamond-shape boxes). 

The architecture of Pi was designed after two years of Po usage . It features a 4-times-bigger 
central computational matrix with accordingly wider RAM, a faster host interface, and a much 
more flexible global interconnection network for data routing and switching. 
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For the purpose of evaluating Xilinx-based designs such as ours, it is convenient to define the 
Programmable Active Bit as in [BRV89]: a PAB consists of a universal 4 input combinatorial 
gate and a synchronous flip-flop. Using this measurement unit, our two RAM architectures Po 
and Pi have the following vital statistics: 



PAM 


part 


number 


PABs 


Pmax 


Power 


RAM 


Host bus 










(MHz) 


(Gbops) 


(MB) 


(MB/s) 


Po 


XC3020 


25 


3.2K 


25 


80 


0.5 


8 


Pi 


XC3090 


23 


14K 


40 


588 


4 


100 



This chart exhibits the three most important architectural parameters conditioning which 
application benefits most from a PAM speed-up: 

• The number of PABs (3.2K for Po and 14K for Pi), together with the application- 
dependent maximal clock frequency (25 MHz and 40 MHz) at which we can reliably 
operate. The product of these two numbers is the maximum theoretical computing 
power of the PAM, expressed in Gbops (80 Gbops for Po and 588 Gbops for Pi). 

• The host bus bandwidth: 8 MB/s for P 0 through a VME bus, and 100 MB/s for Pi 
through a TURBOchannel interface [DEC91]. 

• The size of the local (fast) RAM: 0.5 MB for P 0 and 4 MB for Pi. 

3 PAM programming 

Programming an algorithm on the PAM is similar to casting it in conventional hardware 
(gate-array or VLSI), with two very important differences: first the target hardware provides 
a clean implementation of the synchronous logic model, so there is no need to worry about 
low-level electrical details; second the whole design process is entirely software operated with 
a fast turnaround time (5 to 30 minutes edit-compile-run loop), so it can be approached with 
the same methodology of piecewise testing and successive refinements as a software design. 
For a given application, it involves: 

1. identifying the critical computations which are best implemented in hardware; this is 
usually done by successive refinement, under constraints of communication bandwidth 
and load balancing between the software and hardware; 

2. implementing the hardware part on the PAM and gradually optimizing it; 

3. implementing the software part on the host processor and gradually optimizing it. 

Step 3 above is done with conventional techniques. Step 2 consists in describing the logic 
design to implement in the PAM down to the individual bit level, as well as its geometric 
layout, much as is done for a conventional VLSI design. 
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For this, we have developed a tool suite in which the hardware design is described by the 
writing of a program in a conventional programming language (we use Lisp, C++ [Tou92] 
and Esterel [Ber92]) using a specialized library. This program will describe the various logic 
modules by their bit-level logic equations, or by using standard library modules (adders, 
registers, standard interfaces. . . ). It will also contain layout (geometric) information to the 
level of details the designer wishes to specify: this usually starts with global floorplanning and 
detailed layout of the important datapath components only, to be enriched with more precise 
descriptions along the optimization process. 

Executing this program produces a partially placed, hierarchical netlist which can either be 
simulated, or compiled to the final PAM configuration in a fully automatic way, through tools 
developed in-house (board-level routing, logic optimization) and standard Xilinx back-end 
software (chip-level placement and routing, bitstream generation). 

The design can then be downloaded into the PAM for debugging and testing under real 
conditions; its maximum running speed can be characterized, and its critical paths identified 
(we have developed specialized interactive tools to help visualize the latter). The designer can 
then gradually optimize it, for example by adding extra levels of pipeline to increase the clock 
speed, or optimizing the geometric layout, or adding new functionalities to further unload the 
software. The most important point there is that, as compiling and trying a new version costs 
no money and only a few minutes to an hour of time, it is possible to incrementally design, test 
and optimize the algorithms and implementations involved, as is usually done with software, 
as opposed to having to have them correct and optimal on the first try. 

4 Ten PAM applications 

The following applications were chosen to span a wide range of current leading edge 
computational challenges. In each case, we provide a brief description of the design, the 
names of the implementors, and a performance comparison with similar reported work. In 
each case, the following simple paradigm has been applied: 

compile the inner loop in PAM hardware, and let software handle the rest! 

In what follows, we let (a b) represent the quotient and [a -\- b) the remainder in the integer 
division. 
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4.1 Long multiplication 



Host Data %- 



As/pReg 



Bs/pReg 



Ss/pReg 



Host Adr. 



2 



A Reg 
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512 / 2K 

x 

Mul. 
Slice 



Mul. Cntr 



Ps/pReg 



32 



We have programmed both PAMs into long multipliers (n = 512 bits for Pq, and n = 2K 
bits for Pi) computing P = A x B + S, with A a n-bit multiplier, and B, S arbitrary size 
multiplicands and summands (see [Lyo76], [BRV89] and [SBV91]). These multipliers are 
interfaced with an arbitrary-precision arithmetic package BigNum (see [SVH89]) so that any 
program based on that software takes advantage of the PAM without modification, by simply 
relinking with a modified BigNum library. This respectively speeds up raw multiplication by 
a factor up to 24 (Pq) and 30 (Pi) for long operands, as compared to optimized assembly code 
running on the host workstation. For example, Po equipped with this design computes RSA 
encryption/decryption at 1500 bits per second for arbitrary 512-bit keys (see [RSA79]); this is 
about 10 times faster than our best software version on the same host. 

The Pi implementation produces product bits at 66 Mbits/s, which makes it faster than any 
known machine for which we could obtain benchmark measures. It is at least 16 times faster 
than the best figures for a Cray II or a Cyber 170/750 reported by [BW89]. This multiplier can 
be used to compute a 50 coefficient (16-bits) polynomial convolution (FIR filter) at 16 times 
audio real time (2 x 24 bits samples at 48 kHz). 



4.2 RSA cryptography 



To further investigate the tradeoffs which are possible in our hybrid hardware/software system 
we focused on the RSA cryptosystem (see [RSA79]), which can be cast entirely in terms of 
long multiplications. Starting from the above general-purpose multiplier, M. Shand from DEC- 
PRL implemented a series of hardware/software systems spanning two orders of magnitude 
in performance. The latest version is based on an original hardware design for computing 
modular products at the rate of two bits per cycle [SV93]. 

The system originally used three differently programmed Pq boards, all operating in parallel 
with the host (see [SBV91]). At 200 kbit/s decoding speed, it was faster than any currently 
existing 512 bits RSA implementation, in any technology, as of February 1990. A survey by 
[Bri90] grants the previous speed record for 512 bits keys RSA decryption to a VLSI from 
AT&T, at 19 kbits/sec. 
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M. Shand recently ported this RSA system to a single Pi board; at 40 MHz, this design 
provides either two independent 600 Kb/s RSA encryption channels for 480b keys, or one 175 
Kb/s RSA encryption channel for 970b keys. 

4.3 Data compression 

M. Skubiszewski from DEC-PRL has implemented a Pq design to speed up the algorithm of 
[ZL77], which is well known to achieve an average data compression ratio varying from 2 for 
English (or French, or Polish. . . ) plain text to 3 for C (or Lisp, or Pascal. . . ) source code. 
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The design is a massively parallel method which computes 64 byte comparisons on each 
(70 ns) cycle; it matches the next 16 bytes in the file to be compressed against the last 4k 
bytes seen (stored in the local RAM), in order to detect the longest substring previously seen. 
While this design performs a respectable 1 Gops (8 bits integer comparison), it ends up in 
a disappointing factor two speed-up, when compared to optimized software such as Unix 
compress. Indeed, such optimized software avoids most of the comparisons performed in 
the hardware, by detecting early that they are irrelevant to the final output. A more elaborate 
hardware design is needed to genuinely speed up this particular algorithm. 
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4.4 String matching 



Given an alphabet A = (ai, . . . , a n ), a probability (5y)jj = i...„ of substitution of a,- by ay, and 
a probability (/,•),•=!...„ (resp. (-D,')i=l...«) of insertion (resp. deletion) of a,-, one can use a 
classical dynamic programming algorithm to compute a probability of transformation of w\ 
into W2 ; this defines a distance between any two words w\ and W2 over A. 





30K Words Dictionary 
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1 




PI — - P2 — ► P3 — • P4 — > P5 
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1 












Coefficients 





D. Lavenier from IRISA (Rennes, France) has implemented this algorithm with a Po design 
which computes the distance between an input word and all 30K words in a dictionary; it 
reports the k words found in the dictionary which are closest to the input. The system processes 
200K words/sec: this is faster than a solution previously implemented at CNET using 12 
Transputers; it has only half of the performance obtained by a system previously developed at 
IRISA, based on 28 custom VLSI chips and 2 PC boards. 

Applications of this algorithm include automated mail sorting using OCR scanners, on-the-fly 
keyboard spelling corrections, and DNA sequence matching (see [Lop87]). 



4.5 Heat and Laplace equations 



[Vui93] shows how to adapt the classical finite difference method (see [FLS63]) to compute 
solutions of the heat and Laplace equations in n dimensions with help from special purpose 
hardware. An implementation of the method on P\ operates with a pipe-line depth of 128 
operators: 



R 

A 
M 



Each operator computes: 
0(v o ,vi) 



vo, when vo -|- 2 = 1, 

2(v 0 -f 4 + vi -f 4 + (vo -|-4) -f 2), otherwise, 



R 

A 
M 



(1) 



all with 24b fixed-point data format. At 20 MHz, this amounts to 5 Gops (24b adds, tests and 
shifts); it is easy to show (see [Vui93]) that fixed-point gives the same results as floating-point 
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operations for this specific problems; the achieved performance thus exceeds those reported 
in [McB89] and [MF+91] for solving the same problem with super-computers. A sequential 
computer needs to execute 25 billion instructions per second (25 Gips), to reproduce the same 
computation. 

The heat and Laplace equations have many applications in mechanics, circuit technology, fluid 
dynamics, electrostatics, optics, finance, and so on. 



4.6 Newton's mechanics 



J. Vuillemin has specified a Pi design for computing the evolution of a «-body system, using 
Newton's equations. The design computes the gravitational field acting on body k by summing 
the individual fields induced at k by each other body in the system. This amounts to the 
following 18 operations: 



(xi -x k ) = 


> dx 


{yi -yk) = 


> dy 


{Zi -Zk) = 


> dz 


{dx x dx) = 


> dx2 


{dy X dy) = 


> dyl 


{dz x dz) = 


> dz2 



{dx2 + dz2) = 


> dxz 


{dxz + dy2) = 


> dl 


Vd2 = 


> d 


d x dl - 


> d3 


l 


> fd 


fd X nii - 


> fin 



fin X dx = 


> fdx 


fm x dy = 


> fdy 


fmx dz = 


> fdz 


fx +fdx = 


> fx 


fy+fdy = 


> fy 


fz+fdz = 


> fz 



Positions and forces are represented as 20 bit floating-point numbers. Assuming a 40 ns 
internal cycle (achievable through deep pipe-line) the expected throughput exceeds 2.5 GFlops 
(this design has not been tested at publication time). 




4.7 Binary 2D convolution 

B. Chen and J. Vuillemin have implemented a 7 x 7 binary 2D convolver on Pq, for performing 
erosion, dilation and matching on black and white images, as defined in [Ser82]. 
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The convolver runs at 25 MHz, generating one pixel each 40 ns; it completes a single 
convolution pass over one 512 x 512 image in 10 milliseconds; this allows for up to 
4 successive operations (erosion, dilation, or matching) at video rate. Reproducing this 
performance through optimized software would require a 200 Mips computer. 



4.8 Boltzmann machine 



M. Skubiszewski has implemented two successive versions of a hardware emulator for binary 
neural networks, based on the Boltzmann machine model (see [Sku90] and [Sku92]). 

The Boltzmann Machine is a probabilistic algorithm which minimizes quadratic forms over 
binary variables, i.e. expressions of the form 

#) = EE w « ( 2 ) 

,=0 ;=0 

where N = (No, . . . ,N n -\) is a vector of binary variables and (w, i/ )o<i l/ <« is a fixed matrix 
of weights. It is typically used to find approximate solutions to AfV-hard problems, such as 
graph partitioning or circuit placement. 



Weight Ram 



Inputs 



Data 
Ram 
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The latest realization, on Pi, can solve problems with up to 1400 variables, using 16-bit 
weights, for a total computing power of 500 megasynapses per second (the megasynapse is 
the traditional unit used in this field, it amounts to one million additions and multiplications, 
or one million terms of (2)). 
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4.9 3D Geometry 

H. Touati from DEC-PRL has implemented a 3-D graphic accelerator for P\, which supports 
translation, rotation, clipping and perspective projection, to directly compute the screen image 
of a cloud of points in 3-D space. 




At 25MHz, it has a peak performance of 1.56 million points per second, using 16 bit fixed 
point coordinates for the input and output, and up to 32 bits for the intermediate results. One 
needs a 300 Mips processor to achieve the same throughput in software. 

4.10 Discrete cosine transform 

This design (by J. Vuillemin and D. Martineau on P{) compresses a video stream in real time 
through multi-dimensional fast discrete cosine transform. The fDCT implements the following 
network: 
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The overall design computes 48 fixed-point (32 bit outputs) operations (add, subtract, multiply 
and shift) on each 40 ns cycle, for a total of 1.4 Gops. To match this performance through 
software would require a 15 Gips processor. 

5 Conclusion 

The following chart summarizes the practical PAM performance achieved by each of our ten 
designs: 



Design 


PABs 


MHz 


Gbops 


Gops 


Gips 


PAM 


Multiplier 


8k 


33 


264 


0.8 


2.6 


Pi 


RSA 


8k 


32 


256 


0.5 


4 


Pi 


DCT 


10k 


25 


250 


1.4 


15 


Pi 


Newton 


10k 


25 


250 


2.5 | 


5 


Pi 


Laplace 


10k 


20 


200 


7.5 


20 


Pi 


Boltzmann 


8k 


25 


200 


1 


1.5 


Pi 


3D Geometry 


3k 


25 


75 


0.5 


0.7 


Pi 


Ziv-Lempel 


3k 


15 


45 


1 


2 


Po 


String 


3k 


10 


30 


0.15 


0.3 


Po 


2D convolution 


lk 


25 


25 




1 


Po 



fThese are Gflops, with 20 bit floating-point numbers. 



The applications are ranked according to the most reliable performance measure, namely the 
Gbops. As a comparative measure of resource utilisation in such systems, the following table 
charts the maximum theoretical performance of generic PAM hardware (in Gbops) obtained 
by multiplying the maximal clock frequency (in MHz) by the area (in PABs): 



PAM 


Area 


1 MHz 


20 MHz 


50 MHz 


XC3020 


128 


0.1 


2.5 


6.4 


XC3090 


640 


0.6 


12.8 


32 


Po 


3.2K 


3.2 


64 




Pi 


14K 


14 


280 


700 



Three years of PAM design lead us to believe the following: 

1. For each of the chosen application, we have shown that the level of performance 
achieved with the PAM is comparable to the best figures reported using super-computer 
or custom silicon circuits. 

Our applications have been carefully selected for having a clearly identified (PAM 
implementable) inner-loop, which accounts for a vast percentage of the software run- 
time. For such low level processing, the PAM proves more cost effective than any 
super-computer. 

Due to their software complexity, many current super-computer applications still remain 
outside the possibilities of current PAM technology. 
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2. Each mentioned PAM design was implemented and tested within one or two months, 
starting from the delivery of the specification software. This is roughly equivalent to the 
time it takes to implement a highly optimized software version of the same system with 
a super-computer; both are technically challenging, yet remain an order of magnitude 
faster than the time it takes to cast a system into silicon. 

3. The cost of Pi is comparable to that of a high-end workstation. This is orders of 
magnitude lower than the cost of a super-computer. Based on figures from [McB89], 
we find that the price (in $ per operation per second) of solving the heat and Laplace 
equations is 100 times higher with super-computers than with the PAM. 

4. Another field of applications, not covered by any existing supercomputer, is open to PAM 
technology: high-bandwidth interfaces to the external world, with fully programmable 
real-time capabilities. The Pi PAM has a 256b wide connector, capable to deliver up 
to 6.4 Gb/s of external bandwidth. It is a "simple matter of hardware programming" 
to interface directly with any electrically-compatible external device, by programming 
its communication protocol into the PAM itself. Applications for this capability are 
numerous, including interfaces to high-bandwidth networks, audio and video input or 
output devices, and on-the-fly data acquisition and filtering. 
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